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, ABSTRACT 

Investigations into item bias provide* an empirical 
basis for the identification and elimination of test items which 
appear to measure different traits across populations or cultural 
groups. The Psychometric rationales 'for six approaches to the 
Identification of biased test items are revj^ed: (1) Transformed- 
yitem difficulties: within-group p-values are standardized and 
^compared between groups. (2) Analysis of variance: bias is 
operationally defined in terms of significant item by group 
interaction effects. (3) Chi- square: individual items are 
investigated in terms of between group score level differences in 
* expected and observed proportions of correct responses. (^) Item / 
characteristic cui^ve theory: differences in tRe probabilities of a 
correct re'sponse, given examinees pf the same underlying ability and 
.different culture groups, 'are evaluated. (5) Factor analytic:* item 
bias is investigated in terms of culture specific and culture common 
^sources of variance, or in terms') of loadings on a biase)^ -test factor, 
(6) DiStractor response analysis: the relative attractiveness of item 
,f oiJ.s, . or response sets, is investigated .. The limitations and 
advantages of each approach in terms of the\un^erlying^ assumptions,! 
psychometric soundness, conceptual complexity, applicability to 
criterion referenced, tests and applf cabilit^y to interdependent^ groups 
are discussed. (Author/HV) I 
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ABSTRACT 



Different nethodplogies have. been proposed for the evaluation of bias 
both in seledfeion and assessment instrunsnts and in the iteins within such meas- 
ures. While bias in* an instrument as a v^KDle is of prime cnncem to test user^ 
and has received considerable attention in popular and projEessional literature, 
bias in test items is of increasing concern to test developers. Investigations 
into item bias provide an enpirical basis for the identification and elimina- 
tion of items vAxich. appear to neasure different traits across population/culture 
groi:^s. Thus, they helR to decrease bias in instnjTients under develqpment. 

^ This paper reviews the psychoannetric rationales of the follcwing^six types 
.of approaches to biased item identification: ' ^ 



1. transformed item difficulties approaches in vAiich within-groijp p-values 
aiB standardized and. ocnpared between groi^Ds. 

2. analysis of variance approaches in vMch bias is operationally defined 
in terms of significant itanfi by groip interaction' effects. 

3. dii-square approaches in vAiich indiXddual items are investigated in. 
terms of between groip score level differences iri expected and 
observed proportions of correct responses. ^ ^ * 

t 

4» itgn characteristic!! curve theory approaches in which differences in 
the probabilities pf a correct response, given examinees of the same 
underlying ability and dif'^ererit culture groips, are evaluated. . / 

5. factor analytic approaches inv^ch item bias is investigated in 
terms of 'culture specific andculture ocniron sources of variance or 

t in terms of loacjings on a biased test factor. 

6. distractor response analysis approaches in v^Mch the relative 
attraptiveness of item foils is investigated*. 



LimitaticHis and advantages of ,the approached in terms of their underlying assorp- 
tions, psychometric sourtdnes$, conceptual conplexity, applicability to criteri( 
referenced tests and ajplicability to interdependent groips are discussed and 



evaluated . 
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/ Efforts Ibward the DeveloptiBnt *of 
Unbiased Selection and Assessment Instruments* 



Approximately. 25 years ago Eells and his colleagues conducted what appears 
to oe Jthe first serious attofipt to examine test items for bias <Eells, et.al., 
1951) and developed one of the first measures purported to be culture fair. 
Since that tims, the entire ijsue of cultural bias in measuranent has become 
heated, ocnplex/ and pronounced in the literature. Actions by the National 
Associatipn of Black P^chologists, the American P^sor^nel and Guidance Associ- 
ation, the National .Education Association, the National Association for the 
Advancement of Colored People, the National Association of Elementary School 
Principals and the Council of the Society for the Psychological Stucfy of Social. 
Issues calling for mDritoria on Certain types of ^ tests, banning tests, and 
requiring alternate plans for testing, indicate the serious nature of the cur- 
rent situatidn (see Williams, Mosby and Hinsen, 1976). The concern is also 
af^^arent in recent litigation ( DeFunis vs. Qdegaaj-d , 1974; Diana ys. the . 

California State 'Board of Educati on, 1970; ^ Hdpsen vs. - Hansen , i967) . Naturally, 

\ ^ ~ ' ^ 

all^this has not gone unnoticed by^ those involved in the measvir^ment field. 
Bias and debiaslng studies have occurred and varidus models been proposed in 
e\)er-expanding efforts to meet the challengg^. of^ bias in educational assessment.. 

pne major type of bias investigations ^ concerned with the instrument' *• 
as a vihole and examines th^ question: J^s a test unduly favor or impede / 
examinees fron different parts, of the country, or of different 6ac%rounds? 
Another is concerned with the items within a test ^jhdP' asks : Which items aDd 



*The author is indebted to David Kni<^ht and to William M^rz for their 
valuable assistance with earlier drafts Of tliis report;, to Sonya Johnson and 
Eileen Roper for their editorial assistance and to Jacqueline Cox and 
Marianne Walker for typing this manuscript. / , / 



i%em fomats are appropriate for a given population and vMch my be used across 
given cultures? - ■ "* 

Tte first type. of investigation is pf pterest to the test users who 
need to know the accuracy of the test info)^tion. The models proposed by 
Cleary (1968) , Thomdike (1971), Darlingt9n (1971), Cole (1973), Einhom and 
Bass (1971) and Gros3 and Su (1975) (also'see the antire spring 1976 issue of 
the Journal of Educational Measurement ) exernplify this first type of investi- 
gation^^ The s^oJnd type of investigation is of interest to developers as it 
assists them in developing valid and cross-<:ulture fair items ^d ' provides a 
framework for constructing better tests in subsequent efforts. The wDrk of 
Angoff (1972)', Cardall and Coffman (1964), Green and Draper (1972), Merz (1973, 
1976a) , Rudner (1977a), Sche'unaran (1975) and Veale and Foreman (1975, 1976) 
have been directed at this need. It is this second type of bias, itan bias, 
v\*uch the present pap^r addresses. " ' ' 



. Bias and the Item Tryout Procedure 
Test and item bias generally stem from two major squrces, the jhuman ele- 



msnt involved in test development and 'the procedures used to evalua^. the test, 

and tefet items. The first source of bias stems^frari cultural differences betv^en 

test developers and sane test users. That i^, the cultural inoongrmty 

jDetween test develc^rs and users may subtly manifest itself in^atems \^ch are 

insensitive / to the e^^)^iehces> morals i and thinking of particjilar cultural 

groups.' Effortsrb^ test developers to include members of various ciiLtural- 

groi:^ ih the developnent and review of items will help ideyitify some biased 

i 

itans^lsee Green, 1971; Fitzgibbons, ^1^71) , but certainly yhot all. 

-The second soUrce of bias.ccxnes into play vhen datii from a popiolation 
sanple are used to iitprove the effectiveness of a test .(Green, 1972). This 

procedure, v^ch as Green points out, has not changed in 5p years (Cf . Ruch, 

/ - ' \ ^ I ' ^ . " ^ • ^ 

1929, CH^ter ^ Lord and Novices, 1974, Chapter 15) is basic to ,the develop^ : 

ment of effective achievement tests. Bowever, during the item- tryout, -the 

characteristics of the doninaAt groip'will tend to overshadow tfhose.of minority 

- groups. As^a result, items v^idi^e most sensitive to the abilities, oogni- 

' tive styles,^ and knowledge of the dcmin^t group are -selected. Such it^ms my 

be biaseS against the examinees v^ose attributes diverge from those of the* 

cqllective item-tryout sanple. ^ ' / 

The developnient of a standardized measure typically in^Ives the admin- v 

istration of a carefully developed item pool to a large representative sanple 

of ^ixamipees vtose attributes^ are similar 'to those of the intended population 

j^of examinees^ Typically,^ a -measure of each it^i'^s discrimiriation pd^er, e.g., ^ 

the item-test point biserial.correl^ticxi, ^is cdtputed and those items discrimf^ • 

natifig best are reta^ined^ As the population of this country is largely i^te 

middle class, 'the items mDSt sensitive to white middle class attributes are 
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those* vAiich are nost often reifaj^Qed^ 

Green (J.972) Examined a few questions cijx)Ut this prooedure: Aire different 

items retained vrtnen different cultur;^ groups oonpose the item-tryout sanple? 

- ' * .. , » 

Will scores .differ usin^ tests ocxtpDsed of -uniquely retained itenjs? Will test 

.reliabilities differ usi^g the different tryout sanples? 

Using the dif feifent levels an<d subtests of California Achievement 
Test battery as' item pools and different subgroups -of the standardization 
Sanple as item-tryout sairples, Green conputed separate sets, of itanrtest point 
biserial corrections. Fran each set of correlations the best half of the 

litems (those with the highest correlations) were noted and pairwise oarparisons 

made. Aberrant items were then defined as those items* retained based on one 

subgroip of a pair, anc3 injected based gh the other. 

If all the subgroips responded to the items in the same manner, 

^ ^ % ' 

identical items would be retained. Hcweverj, this did not occur. - 'Jhe overall, . 
median prc^rtion of id^tical items vvtiidi v^re^^tained"^Sf conparing all 21 * 
possible pair§ of itemrtryput sanples was Only .70 — a -relatively low percentage. 
Clearly, different item- tryout sanples frcm different cultural baokgrount^ . ^ 
lead- to the selection of different items. ' ' . ' 

. This, in itself, i§ not 'disturbing. Since the point biserial corrielation 
is partly a function of item difficulty, one might expfect a number of items to 
be uniq\:ely related. However, sippose different items are retained for v^tes 
and blacks, and blacks obtain dissimilar total scores using (1) the items^ 
uniqi^ly Vetaiaed base^ on blacks and (2) the. items uniquely retained l^sed. • 



>''iCTB/McGraw Hill (1974) and Ozenne, Van Geldei; and Cohen (1974) have used 
dia^rdant point biserial correlations as a method of identifying biased items 
in (developing and restandardizing national achievement tests. 
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on v*iites. This would te cause* n^r alant). ,1^ cxm^i^^ such sets of scores, . 
Green found correlations ranging frcrn - A!? to +.82 With a nedian of about .5. 
Since the nvsAfer of itans in these tests txarposedi^of uniquely retained, items' 
ti^ere less than the original itean pool, the reliabilities were ^paf^^onding low. 
Kbwgver, even after correcting for attenuatipn (bringing the niedian borrelatiori 
to a^ut •8) , large anoijits of variance in each set were still unaccounted for-* 
These different scores indicate^ that the unique, items taken collectively may •* 
measure ability differently "Across populations. 

Green also conputed the Kuder-Richardson Reliabilities (KR-20) of the ' 
item pools using different .cultural grotps . Dif ferin^reliabilities woultf 
indicate that the scores of one cultural groip contain more erifor than tipse 
of another cultural group. The median 'KR-20 reliabilities in Green's st^:^ \ . 
vere all .^92+ .02. Clearly, ther^ was little evidence o^ bias by this .criteti^. 
Perhaps this was because measures of internal consistency/ such as the KR-20, ^ 
are largely sensitive to test length (Guilford, 1954, pp. 352-353). 

In sumration. Green showed that different items most probably will be , , 
selected vAien dif^pent cultural groips are used as the item-tryout sanple and 
that scopes obtained frcjn' these uniquely selected items will differ, even 
though, the item pools exhibit high degrees of internal oonsistency . ^ The task, 
drierx, is to itodify the test development protedure so that items vMch are 
undul^ sensitive Jx> cultural differences can be identified^ ^d either revised 
or eliminated. ; . ' * ' ' ly * 

^ ( j'^f^Daches to Mased Item. Identification 

Recently, procedures and models have been proposed and advocated for 
identifying biased items within a test: (1) analysis' of variance 'afproaches, 
(2) traansfomed item diff icij^ties (p-values) approaches, (3irchi --square 
approaches, (4) item ciiaraoteristic curve theory > approaches, (5) factor analytic' 



approaches, and '(6) distractpy response analysis approaches. ' Tlte interested 

reader is referred to Green 'and Draper' (1972) for an enpirical investigation 

josing and oonpajiing a few of the earlier approaches within the second through 

fifth categories and to Merz (in preparation) and Jluciier (in preparation) for 

i ^ . t , • 

enpirical investigations ocnparing sane newer approaches,, . ^ 

Analysis Variance Apbroaches * ' * 

— ^ — 

In the first type of approach/ vMch defines bias as a significant itan hy 
grovp^ interaction, subjects sampled fron\twD or nore pop\ilations are given a 
conttion test and the resultant variations in item scores are aj;ialyzed by an t 
analysis of ^^arianoe design. Variance could be attributed to differences in 
(1) items, as sane itans ^a^nore cjifficult thsai^others; C2) groi:¥)S, as one 
group may have mDre of the measured. a ttriiut^e than another; (3) subjects with- 
^in groups, as examinees will ^ differ in ability; and (4) an interaction of the^ 
items ^arid the groups. When the groujis are defined by cultural affiliations, 
a. significant item ty^culture interaction is indicatiA/e of sate itaiis being 
relatively morfe difficult for nentoers of one cultujhe than another. Post hoc 
testing proced^^ires , such as EXmcan's .Mtiltiple Range Test (Duncan, 1955, 1957), 
can b$ lised to identify specific items shewing bi4S in tenns of significant 
differences in relative item ^fficulty. 



Exaitples, of this agproach axe found 'in Cardall and QDffman (1964) , Cleary 

, / " • , * 

>' • . ^ 

and liilton (1968), Eagle and Harris (1969), Hoeptner and Strickland (1972), and 

Jensen (1973) . In order to use this approach properly, extremely large sample- 

sizes are required in order to control fo^ variables such as IQ, socio-economic 

status, parental education level, ethnicity^ and ^ttitudes. Ifowever, this is 

true for all investigations into iten^ and tept bias . ' , • 

Jensen (1973) reported two studies in which he attempted control by 

matching subjects from different cultures on their mental age. In both studies 
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great reductions were found in the item by culture interaction after matching, 
indicating that the procedure may be more sensitive to ability than to cultural 
variations, ' . ^ - 

Jensen cCHifSnrec! this in a seq(2frid investigation. Aftet usiag an analysis 
of variance approach with white and black subjects (without matching rental 
age), he conducted a second analysig^pf variance using tMO groups of Caucasians 
vAiose score distributions* closely matched thpse of the blades and vrfiites in 
the first part of the stucfy/ The results of this psepdo-race ocnparison closejy^ 
matched those of the true .race ccarparison, especial with regard to the iten 
by culture interaction. He concluded that "it woulq be extroriely diffioiLt to 
make a case that the race by items interaction is attributable to culture bias" 
(p. 17).^ Thus; Jensen claims 'that this progedure may be sensitive to differences 
,in ability rather than to cultural differenced. 

Whether or not Jertsen*s cl^im is valid, two additional major problems 
wit;h this aj^roach exist. First/ th^ practical alpha level in the post hoc 
^analysis can becone inflated as the^^nunber'of items increases. Hence, one must 
be awafte that some items may be erroneously classified as biased. The second 
arid more seirious problem arises frcxn the underlying assunption that the total 
scores are unbiased. Inasmuch as the identification of biased items may" contra- 
diet this assijnption, the procedure poses sane conceptual probleins. 
Transformed Item Difficulties' Agyroaches . ^ ' ^ 

The trarlsformed item difficulities approaches, providing for a visyal 
examination of item by groi:p interaction effects, were probably first described 
by .Thurstone (1925) in connection with his method of absolute scaling. Of the 
approaches, *this method appears to be one of the best kncwn. ^ It has been 
advocated and used frequently by Angoff, (1972; and Ford, 1974; and Modu, 1973) 
and others (Green and Draper, 1972; Jensen, 1973; Hicks, et al., 1976; 



Strassberg-Rosenberg and Donlon, 1975; Echtemacht:, 1975; Rudrfer, 1977b) . 

Further, the approach has ^ appeared in at. least one msassurenent textbook 

(Anastasi, L976, pp. 222—226). 

• . ' • * * 

In this method, indices of itan difficulty~i.e. , p-values— 
are obtained for two different groups on a dumber of itans. Each 
p-valus is converted^ to a nonial deviate and the pairs of nornal 
deviates , one pair for^ each it^, are plotted^pn a bivariate - 
graph, each pair represented by a point on the graj^ (Angoff , 
1972, p. 1). ^ . ' ■ ' = ^ 

..The plot will generally be in the form of an ellip^^ A 45 degree Ifne,^ 

passing through the origin, provide^ a theoretical regression indicating the 

absence of bi^. Items greatly, deviating frcm this line iray be regarded as. 

exhibiting an item by grotp interaction. That is, relative to the other 

iteds, deviant items are especially wore difficult for mentoers of one group 

than the other. Assuming both groij^s received similar instructions, such 

items' would appear to represent different psychological meanings for the tw 

✓ 

groi:ps ^of examinees . . . ' ' 

Sinoe the intent is to make carparisbnfe of bet^een-groip dif ferencjes in ' 
iti^ difficulty, it is necessary to transform the proportion passing an item 
to an ihdex of item difficulty which oonsti€vites at least an interval scale. . 

7 ^ ' ' ^> . • 

This is accortplished by expressing each item p-value in, terms of within^roup 
deviations of a nonral curve (see Guilford, 1954, pp. 418-419). -Any line^ 
transformation of the item z-score will meet such a requirement. One such 
transformation has been Delta values (4z + 13) , • ' 

The -distance of an iten point to».-l;|ie line can be treated ^ a 'measure of 
the degree of item bias. Ore can determined \^ch items are "greatly deviating 
* from the line by incorporating any of the traditional gr tontra{iitional methods 
of outlier or residual analysis. One method is to p],aoe confidence limits op 
the line by using a multiple of the .standard error of estimation. 'An alternate 
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appfoadi, adopted by Strassberg-Rosenberg and Dcnlon (1975) and Hicks, et al. 
(1976) involves cxnputinq the standard deviation of the residuals and classifying 
as biased those itai^ deviating by greater than 1.5 standard deviation units'. 
Rudner (1977b) has erployed a fixed item-regression "iine distance of .75 z- 
score units. 

An exanpl^of the ajproach is shown' in Figure 1. The transforiiied p-values 
have a correlation of .approxirrately .90 ; making the plot relatively long and 
flat. The solid line represents the main axis and t±ie dotted lines represent 
linear confidence limits. The itap represented in the upper left, • outside 4±ie 
confidence inten/al, vould be considered biased. 

As a modification of this procedure. Green and Dra^r (1972, p. 16) sug-^ 

gest that the. "item- test biserial correlations^ might be incorporated . . so 

• • • * ' . . . 

as to Qstijmte the linear test score-item score regression whereby item diffi- 

culties may be foniBd in a ranner analogous to the way in vMch adjusted neans 
are formed in an Analysis of Coyariance," Since by this procedure, differen- 

^tial item discrimination indices and 'item difficulties voDuld both influence 
item locaticxis on the regression plot, litems which have proportional p-valpfes 
but disproportionate discrimination indices wpuld have a greater tendency to 
deviate from the main axis o^ the scatterplot and shew up as aberrant. ^ 
Chi -Square Approaches ' ^ ^ ' . 

A third approach biased itOTi analysis determines whether examinees of 
the .sams ability level have the same 'probability of a correct response regard*^ 

'less of cultural affiliation , This JlS accorplisfied by dividing the tryout 
samples ^into groups based on their (^served score and conparing* the .proportions 
of stidents within each level responding correctly with a chi-square test for 
indepeftdent observations (Scheuneman, 1975, 1976; Green and Dirapet, 1972). An 
item is ocaisidered unbiased if, for all individOals'in the same total score ' 
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Figure 1: A^Hypbchetical trcinsfopned item difficulties 
_Scatterplot 
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interval, ttie proportion of correct response, is the same for both groi5)s uncjer 
consideration. . A irodified dii-square test determines the prc±>ab41ity. that an 
item is ijnbiased by this definition. 

Scheinonan (19761 , in flying the approach, to severaJL sets of data, 
advocates .using four or five total score levels based on the score di^tribu- ^ 
* tion ot the smaller sanple (Green and Draper had oosed within-groijp quintiles) . 
^""^ As with the analysis of variance ^proach/ the procedure fecpairfes a large num- 
ber 6t inference tests. Again, unbiased itetis may be misclassifi-ed as biased 
because of inflated alpha levels. Further, the procedure assumes total soofes 
bD be valid measures of ability and appears to be unduly sensitive to differ*- 
qnoes in the total score disti^jb^tions of the examined sanples. 
Item Characteristic Curve Theory Approaches 

Laterft trait or item characteristic curve theory relates the prc^^abilii^ 
of a oorrec^^tem response to^ a function of an examinee's underlying ability 
level (Oj^)' ^d characteristic (s) of the item. ' While the various models (Lord, 
1952; Rasch, 1^0; Bimbaum,^ 1968; Urry, 1970) differ in terms of the number . 
of item parameters considered; they all describe the itoii parameter (s) iride-' 
pendently of the examined sarrple. This attractive property has led to the 
.development of sons interesting applications ih test developnient, adaptive 
testing and equating/s^and may^prbve useful in detecting item bias. 

One general, cumulative logistic model formalized by Bimbaum uses - 
three item parameters: . ag - an item discrimination index, bg - an item diffi- 

culty index, and Cg. - a psei:jdo guessing parameter. Using* the notation P(ug=ll0i) 

^ ! . 

to represent the prob^ility of a correct response to iton g given an examinee 
of aj^ity level O^v Bimbaum *s three parameter rrbdel States that: ^ 
P(Ug=l|0i) = c^ + (1 - Og) [1+exp (-1.7ag (0^ - bg) • 
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This relationship between 0^ ^d P(Ug=l|0j_) is illustrated in Figure 2. 

■ 14 " • 



Tha prcdDability of a cjorrect response given a 'specific ability level 
increases monotonically as true ability increases. For exanple, an examinee 
with a high tn:e ability, e.g. 0j, has a higfi prc±)ability of responding correctly 
.[P(Ug=ll0j)-^"l.O] . Conversely, an^examinee of Ibw 'true ability, e.g. Gj^, has a 
low probability of responding correctly; approaching the lower asyirptote of ^ 
the curve, Cg. ' • ^ . 

Ths inflection point of the curve, hg, is referred to as the itan diffi- 
culty parameter in that it indicates the relative positiai of the curve along 
the 0 axis^ The more the cairve is positioned to the ri^t, the more ability. is 
i>eoessary for an examinee to have a good pn±)ability of a correct response, 
Tte slope of the curve at bg helps define a third- paraireter, ag. This vali:e, _ 
ja^^jxed to as the discrimination pararteter, indicates the power ""bf the itan 
to separate examinees of close butjunequal levels of ability. Although the itan 
parameters and 0 are on a connDn metric, these item paraireter^ described 
characteristics of the item independently of the, examinee group. Ftall explana-* 
tions and development of this ^gL-othar mental measurCTent models can be f oiand 
JensQTB (1972) and m Lorded ihvick (1974). 

Latent trait theory has been used to Mentify biased items (Green and 
Dr^et, 1972; liOrd, ingress; Rudner, 1977a)'. In an e^ly study, Green and 
D:^^i^ had used observed total scores as estimates of examinees' abilities, 
0i'6, and the prcportions of examinees respcxiding correctly at each total 

* * r 

; 

score level, as estimates of Pfug^llOi) - Their prooedui^e called for plotting 
estimates ice's for each item separately 'for each culture groip clnd oonparing 
the plots^ . • • " \^ 

By this and other latent trSit theory approaches, an it^ is unbiased if 
examinees of the same ability level, but of different cultural affiliatioits, 
have equal prc*)abilities of responding correctly. That is, an item is unbiased 
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if the estimated ice's c±)tained fron the various culture groips are identical. 

As an iexanple of a biased item, donsider the twg hypothetical curves shown .in 

* Figure 3. These curves are based on responses by two different culture groips 
' ^ ** * 

to the sane item. Ibtal observed scores are used as estimtes at 0^ and pro- 
portions o£ examinees responding correctly are used as estimates of P(Ug=l|0^) . 
•The curves are not identical, since. the location parameters for the twD curves 
are not equal. Such an item can be considered biased in that often examinee^ 
of the same ability level, e.g. X = 58%, but from different culture, gtoi^Ds, 
do not have similar proportions of correct responses. ^ 

While this af^roach is aj^sealing, total observed aooi^es'. are directly 
inoorforated and quantification of the degree of item bias is difficult (an 
, eyeballing procedure is used to identify a "veiy baAsed item") 




Rather than using total observed scores as estiimates of 0j_ and proportions 
as estimates for P(u^= l|Qi) / more accurate values can be obtained using one of . 
the recent methods of parameterization (Urry, 1975; Wingersky and Lord,, 1973). 

g parameter izatipn, the metrix: used for the O scale is defined by the 
ability variance in the examined sanple. *In order to oatpare parameters 
obtained frcxn two different examinee groips, the obtained values must be equated. 
Lord and Novick (1974, Ch^ter 16-.11) Snd Rudher (1977b) have shown that this 
can be acccnplished by ocnpiitihg the regressions of the parameter values based 
on one groip^f examinees on t|ie parameter values based on the other grovp of 
examinees. The equated ice's will be identical when the restrictions of the 
model are met. Ihat is, when th^measi^e: 

(1) is unidimensional 

(2) contains locally independent items . ^ 

(3) has error- free.paraneter estimates. 

Rudner (1977a) has refined the procedure used by Greek and Draper to 
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Figure 3 : Two hypotheticlal response distributions 
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identify 'biased items by incx)rporating equated ice pararretef vali:es. The area- 
between pairs* of equated ipc's is used to indicate the relative amount of 
aberrance for eac±i item and'^e^jalling of the equated ice's is enployed to pro- 
vide additional information as to the nature of the aberrance. Lord (in press)' 
has Qtployed an asymptotic significance test based on the sumned variance- 
oovariance natrices of the equated ag and bg parameter estimates to test for 
significant differences between pairs of equated ice's. 
Fajgtor Analytic Approaches 

• In factor analysis, underlying factors (i.e., dimensions or traits) are 
l^^thesized and. the correlations of each variable with the hypothesized factors 
are ocnputed. ^ In an achievement test, each itan is treated as a variable 
Sudi an analysis o^uld be conducted twice using examinees from two^jilfferent 

cultural backgrounds. Ideally, the two separate groups of examines wuld 

' i ■ " V 

yield •similar sets of i'f^em- trait correlations (factor loadings) . Different sets 

of factor loadings would indicate that the two groups are not responding to the 

itons In ^ tide same manner. Such a test wouXd be considered biased in th^t it 

appears to j measure a different trait across groups. The items exhibiting the 

most bias would then be .those with the largest differences in factor loading. 

1^ general model for this type of factor analysis is 

i ^ 
v*iere y.-^^ ^ vector of subject responses 

' A Is a matrix of factor loadings 

f is a \g^ctbr of factor variables (locations) 

• e is a vector of residuaf or error terms 

~ M • 

From y, values of A,\\f, 'apd e are determined, * 

Green and Dtrap^i;. (1972) and Green (1976) suggest an inner.-group factor 
^^^^yS^s model based on the inner-battery factor analysis afproach offered by 

lot ♦ 



•'■ 

.Tucker (1958) • In jthis imer-groip model,, the item variance is partitioned. . 
into: (1) factors oonmon to, each subgroi^); (2) factors specific to subgroups; 
and (3j residual or .error variance. Witlji this model one can determine 
the proportion bf item variance accounted for by a given sub^r^tp,. An item, 
thea> is' unbiased vdien this proportion is small biased if a large propor- 
tion of variance is attributable to culture-specific sourpes. • ' 

Merz (1973, ' 1976a) develcped an alternate approach v*iich incorporates 
factor scores and analysis of variance. In this approach, the item responses 
for the groupa are ccantoin^d, factor analyzed, and factor, scores for each exam- 
on each factor oonputed. These, factois;^ scores are then subjected to an 
[ysis of variance, with groi^) membership being the independent variable. ^ 




Where significant mean differences are found in factor scores, the factor is 
classified as biased. Biased items j^re defined as those with high factor, 
loadings on a biased factbr. - ' * * - 

These approaches are appealing in that they deal with the vunderlying 
latent traits (true abilities) of the examinees. Green and E^apSr's afproach 
is particularly appealing in that variance is partitioned into^ ci^ture-^pecif ic ^ 
-and culture-coinmon sources. Merz's approach has an advantage in tliat variance 
caused by factors such as socio-^i^bon"?^ status, IQ, ana^x can be partialled 
out. Hcweyer, these procedure are not without ^ge ocxicepjj^l as well ag 
practical limitations. C '* . . 

The first stegl in factor analysis is the conputation of the inb^r-variable 
coprelatiois matrix. ; To obtain stable correlations — to Avoif^ capitalization 
on change— one needs a large number of subjects; the g^eral rule ^ thuHpb is 



at le^t ten subjects per vaAabl^, a figure often ignored^ in practice. * 



Assundng a sufficient nit±>er of subjects, there i^ a question as "to vAiich 
type of correlations. to use. In analyzing items, one usually 'deals with 



dLdiotOTo^u^Jy scoreci variables and either the phi (productTmoment). or the 
. tetrachoric correlation is enployed: Hcx^ver, the phi correlation as liiuited 
in that it is highly sensitive to item difficulties, '^d the t^trachoir^c cor- 



relation, though it estitrates what the value of the inter-item oorft^tiOTi 
wouldjae if the iters v?ere ccmtinuous variables; is rK)toriously unsrable, . Xh'^^ 
fact, Nunnally (1967, p. 124) emphatically states that tetrachorjc correlations 

•cannot be use3 in facftor^analysis. ' ^ 

, Regardless of vdiich type of correlation is used, there are additional' 
problems. As Nunnally (1967) points out, . • for a groi^^ of variables "to 

i • , 

» t • 

clearly define a nuirber of factors, there must be a wide range of correlations*' 
(p, 256). In correlating items, especially dichotoTDifeiy-soored iters, the 
average oorretation is typically low, Ifees^it usually is not possible to 
obtain a clear factor structure v*ien factor analfyzing test ijppins. 

Fxnally, in factor analysis many decisions nieed to be/hiade by the, 
researcher. Which procedure? How many* factors' to extract? V>7hicH rotational 
'-echere to use?* Different decisions can lead\$jO different results. Thus vMle 
't4ie factor analytic approaches are ^>pealing,. in practice they may be diffi- 
cult to ^ply. . • 
Distractor Response Analysis 

"Seme o& the ch j.-sqtiare , itemjdifficmty regression, item characteristic 
curve-theory, analysis of varifeuioe, and factor ^alysis ^proaches incorporate 
total test scores either directly or indirectly Thia^ can pose a problem when 
the total .scores do not represent acerbately the abilities of th^ examinees , 
a^ wuld t>e expected in a very biased test . . - * ^ ' 

Veale' and Foreman (1975, 1976) recommend inve^gating the disjbractor - 
response distribution for various cultural groips iJi an approach not dependent 
i5)Qji this a^suirption. Should one group be overly attracted to a particular' 
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distractor in 9oiiparison to a Second groip, there, raay" be^ a, biasing character-: 
istic of the itan attracting thtem away fran'the correct response. Bias is thus 
defined as chai^acteristics of an, item which cause a distortion in the item 
p-vaiue for, a cultural groi^h \ z ' ^ 

Consider the choice- distribution illustrated in Tablq 1. Observed fre-' 
-quQicies appear in the cells and expected frecjjencies appear in the ipper 



J o 1 



right hand comer of each cell. A dispi:6portux5nate uurnbdr of members of 



Groi?) 2 were attr;acted to Distractor 1^ (the i^spcms^^ frequ^ipies can be shcxin^ ♦ 



to be^^disproportionate by the use of a'.dii--Wiuare test/. It may be argued 
that scme^diar^cteristic*of Distractor 1 causM a s\±>starTjJ.al number of meinbers 

of Groi?35'%2 to select this distractor over the correct alternative. Hence 

. , \ ' ^ ' . • -I - ^ 

son^ dm'acteristics of the item may have caused a distortiai in the groip 

\ ^ ' Table 1 ' ^ * ' ' 
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Ito obtain- a global picture of an item's behavior, Veale and Foremai^ 



ccx|||pute sevi^ral statistics on each iteni. These include: ^ 

(1) L chi-square to test the hypothesis that the cx)nditional prdDabilities 
of ijidividu^s jnissing the item by selecting a pafticular di stricter given 
their cultural groip (foil 'puJ^l ^dices) are equal across cultural^ groups ; 

* (2) Craner's V as a Treasure of "cultural variation" to determine the' 
extent of departure fran the hypothesis tested above; 

. (3) Goodman-Kryskal measures of index groups by distractor association; 
(4) sipplementary item statistics- for each cultural group^ including 
2-tests'for testing deviations fron randan guessing, p-values, point biserial 
correlations, and chi~square tests for gauging deviations fron uniform distrac- 
tor ^sponse distribution. ' . 

fhese sippleiTBntary statistics"* help discriminate between desirable^ and 
undesirable items. For exanpie, an item may show la^ cultural variation among 
the distractors and have highly different point biserial, correlations between 
cultural groijps. Such an, item-would appear to work well with one group and 
pOQjI^with another. This inforrotion, coupled with ^e variance in the dig- 
tx: distributions, would probably lead either to elijiunation or revision 
df the item. 

While directly sensitive to bias' in iten} distractors, this approach Is 
only indirectly^ sensitive to other sources of *bias such as those ^ in the item 
stem, directions, or subject matter. If one sus^cts that item bias is most 
often caused by bias in the di^traJtors, .this Imitation is not a serious one. 
Further, by suj^^lementing this appr;oach as Veale and Foreman su(^gest, it is^ 
possible to obtain a holistic view of ttie behavior of the- aggregate item and 
its- constituent distractors. 

Like the earlier chi-squaure and analysis of variance agjproaches, distractor 
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response analysis requires large, number of inferential tests and the conse- ' 
quent probability of oarrdtting Type I errors must be realized, 
# * ' Discussion and Sunrnary ' ' 

Several approaches toward* the identification Of biased items have' been 
presented wit;h their rationale and apparent advantages and limitations. Com- 
ments have also been made regarding the Use of a lapge number 'of inferential 
'tests-y the assunption of an unbiased total score, and the 'use of outlier 

analysis (see Table 2) • 'In practice / depending on the purpose of the study 

... i 

and the.ifiitial item pool, these limitations may be iiaconsequential . 

The pj:actitioner must first "delineate the purpose to which such approaches 
are to be applied. One purpose is to debias an instrun^t during its' develop- 
irent. The degree of item bias (indicate?! by the magnitude of a residual, area, 
factor ^^oading, or F) can be considered along with professional judgments 
of item difficulty indices, item.discriminatioli indices, and factor loaidingi 
to determine which items are to be' 'retained and drof^^ed. In such instances, ^ . 
it is .usually bettej to drop an item falsely suspected of being ^i^t^ed than to 
retain a tniLy biased one. Here, the liinitions caused by inflated alpha errors 
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may l5e mpt in the chi-sqyare, distractor response analysis^ and analysis of - 
variance approaches *. . . • \ 

On the other hand, these techniques can be used to identify trends in 

/ 

biased iteins. That is, biased item^ can be pooled apd attenpts made to iden^ 
tify salient characteristics (sp Rudner, 1977b). .In such instances, one woiiLd 
want a more conservative identification procedure. The tranfeform9d,item diffi- 
ciiLtie^ and the item characteristiq curve theory, approaches are well-suited 
for tKis jLn that the, confidence band can be narrowed or widened as desired. 
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Table 2 



Scsne Salient Characteristics of the Differe^it i^roaches* 
' * • . . (Part I) 



.Analysis of 
, Variance f 



Transformed Item 
* Difficulties 



Chi -Square 



Item Characteristic" 
Curve Theory 



Major literature 



Operational definition* 



Dependence on tptal 
score* being valid 



Ccnputatio^ 



ease 



Ease of conceptual 
understanding by lay- 
pecple 

i^licability to 
criterion 'referenced 
tests * 



Ca-rdall and Coffm^ 
(1964) . ^ 

Significant anal- 
ysis of variance 
itipi % group 
^interaction 



Indirectly ^ 

A 

Difficult 
Medium 

Lew 



AngQff (1572) 



Differential 
relative item 
difficulty ' 



Indirectly 

Easy 
Easy""^^ 

Low 



^Sclfeunaran 
(1975, 1976) • 

Prqoprtiion of 
correct re-, 
'sponses to an 
iteca is'unequal ^ 
for morters of 
different groups 
within the .sam^ 
total score cate^ 
gory 



Rudrier (1977a) 



Prc^ility pf a^ " ^ 
correct' response ^ . 
for a given true ^ 
ability i§ unequal . 
for examinees from 
different *groi?)s* . 



Dij^ctly 



Easy 
Easy 

Low 



No 



Difficult 



Difficult 



Low 



'Table 2 



Bart I (Cont^ued) 
^ 



Analysis of 
Variano^ 



ft 



Transformed I ten 
' DifficilLties 



Chi-^uare 



Iton Characteristic 
Curve Thaory 




i^lifiability to easy 
(difficult) itente 


•Medilin 
fmpdiiirn^ " 


Medium 


— : — t: 


^ 

Medium 


Applicability to more 
than two independent 
and/or interdependent 
cultural gcQi£s_^i_^^ 


'•High . - 


Lew 


Medivm^ 


' Low 
•* 


Applicability to mul- 
tiple dioioe items 
(non-miltiple dioice 
itQpns) 


High 
(high') 


High 
(hirgh) 


High 
(high) 

/ 


High , 
(high) 

\ 


1 . ' 

■^By appropriately defining specific grotp mentoership; ,e.g., 

i 


black females, as 


the independent .variable 

* 


r 


/ 









to 



,er!c 



27 



28 . 



^ Table 2 ' 
SaiB Salient CharactBristics of the Different Approa.ches 

(Part II) 



Major lit^ature 



Operational ,def inition 



Dependence on total score 
-being valid 

Ease of conceptual under- 
standing by lay--people 

Applicability to criterion 
referenced tests 

i^licability to easy 
(difficult) iteans 

Afplicability to more than 
oaie dLndependent and/or inter- 
dependent cultural^ groi:5)s 

J^licability to multiple 
choice items (non-^nultiple 
choice items) 



Factor Analysis 



Greeh (1976); 

Green & Draper (1972) 

Large proportion of 
item variance is gip\jp 
specific 



Indirectly 



Difficult 



Low 



Medium 
(medium) 

Medium^ 



High 
(high) 



Factor Score 



Distractor Response 
Analysis 



Merz (1973, 1976) 



High loading on a fac- 
tor which yields un- 
equal groip mean factor 
scores 



Directly^ 



Difficult 



Low 



Medium 
(medium) 

High 



High 
(high) 



Veale & Foreman ^ 
(1975, i976) 

Characteristic (s) 
of the item distorts 
grovp item p-values . 



No 



Easy 
High 



Lew 
(high) 



Medium 



High 
(no) 



lln oonputing factor scores* . ^ ^ 

^By appropriately defining specific groip meiribership; e.g., black females, as the independent variable 
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Often the inten(3fid audience will- be a deciding factor in deternj^ning 
v*ddi af^roach to use. Here the distrac±or response analysis, chi-square, and 
transforned item difficulties aE^roaches have a distinc:± advantfage since they 
are oorputationally and conceptually easy iri^ can be readily explained to the 
^yperson. ♦ . , 

One may vash to develop a measure that is simultaneously unbiased for 
tlu;^^/cli^tural groups, such as vAiite, tlack, and Chinese Anericans. ^An extaijisiort 
of this involves interdependent culture groi?)s such as male-female and vMte- 
black ocnparisc^is . Such iniberactibns and siimiltaneous ocnparisons can be 
analyzed directly by either the analysis of variance or factor Bcore aj^roaches. 
The chi-sguare, distractor response analysis, and factor analysis ajproaches 
can be adapteid readily for such an analysis by defining grovp m^nbership appro- 
priately, tr^nsfonred item difficulties and ice theory approaches can 
al^ fcje applied but only by using several pairwise oonparisons. \ . 

One final consi53eration is applicability to criterion referenced tests. 
Ideally, the items of such measures are designed to he sensitive to growth, 
rather than to differences among students. Examinees v*K) have not mastered an 
objective are e^q^ected to respond erroneously vMle those who havie met the 
criterion level are expected to respond correctly. Thus one cannot expect the 
large variance of total scores (occasionally, coupled with abnormality assunption) 
required of all the approaches other tha^ distractor response analysis. Ther^ 
fore, if cxie.is interested in analyzing itans in a true criterion referenced 
test, distractor response analysis appears to be the on|^ alternative 
. presently available. > 

In sumnation, there is no' one approach vMch appears best suited for all ^ 
situations, bf the approaches, the distractor response analysis and chi-square 
approaches are the most ^xxtrnunicable — a distinct advantage^ in e^^laining a 

. . - 3i- . 



ddDias4ng investigation .to the lay person. In actually pinpopiting the source 
of bias*, distractoc response analysis. is particularly useful because it alone 
idehtifies which respcpoise alternative is the ^caus^ of aberrance. In addition, 
distractor response analysis .is uniquely applicable to true criterion referenced 
tests. In terms Qf statistical adequacy, ^the ice theory approach is.'appealing 
in t^t^^it is a true s<^re model m^dcing no assurtptipn about the accurancy of 
p-values or inl^vidual total scores. ' 
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